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Summary. Histograms are convenient non-parametric density estimators, which continue to 
be used ubiquitously. Summary quantities estimated from histogram-based probability density 
models depend on the choice of the number of bins. In this paper we introduce a straight- 
forward data-based method of determining the optimal number of bins in a uniform bin-width 
histogram. Using the Bayesian framework, we derive the posterior probability for the number of 
bins in a piecewise-constant density model given the data. The most probable solution is deter- 
mined naturally by a balance between the likelihood function, which increases with increasing 
number of bins, and the prior probability of the model, which decreases with increasing number 
of bins. We demonstrate how these results outperform several well-accepted rules for choosing 
bin sizes. In addition, we examine the effects of small sample sizes and digitized data. Last, 
we demonstrate that these results can be applied directly to multi-dimensional histograms. 

1. Introduction 

Histograms are used extensively as nonparametric density estimators both to visualize data 
and to obtain summary quantities, such as the entropy, of the underlying density. However 
in practice, the values of such summary quantities depend on the number of bins chosen for 
the histogram, which given the range of the data dictates the bin width. The idea is to choose 
a number of bins sufficiently large to capture the major features in the data while ignoring 
fine details due to 'random sampling fluctuations'. Several rules of thumb exist for deter- 
mining the number of bins, such as the beli ef that betwe en 5 -20 bins is usually adequate (for 
example, Matlab uses 10 bins as a default). IScott (1979|) and lFreedman and Diaconis ("1981^ 
derived formulas for the optimal bin width by minimizing the integrated mean squared error 
of the histogram model h(x) of the true underlying density / (x), 

L(h(x),f(x))= J (h(x)-f(x)) 2 . (1) 

For N data points, the optimal bin width v goes as aN" 1 / 3 , where a is a constant that 
depends on the form of the underlying distribution. Assuming that the data are normally 
distributed with a sample variance s gives a = 3.49s (Scott, 1979), and 

iWt = 3A9SN- 1 / 3 . (2) 

Given a fixed range R for the data, the number of bins M then goes as 
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Freedman and Diaconis report similar results, however they suggest choosing a to be twice 
the interquartile range of the data. While these appear to be useful estimates for unimodal 
densities similar to a Gaussian distribution, they are known to be suboptimal for multimodal 
densities. This is because they were derived assuming particular characteristics of the 
underlying density. In particular, the result by Freedman and Diaconis is not valid for some 
densities, such as the uniform density, since it derives from the assumption that the density 
/ s atisfies f f' 2 > 0. 

IStone (1984j) chooses to minimize L(h, f)~Jf 2 and obtains a rule where one chooses 
the bin width v to minimize 

K(v,M)= 1 -(^-^lf: ,A (4) 

v m=l ' 

where M is the number of bins and 7Tj are the bin probabilities. iRudemo (1983) obtains a 
similar rule by applying cross-validation techniques with a Kullback-Leibler risk function. 

We approach this problem from a different perspective. Since the underlying density 
is not known, it is not reasonable to use an optimization criterion that relies on the error 
between our density model and the true density. Instead, we consider the histogram to be a 
piecewise-constant model of the underlying probability density. Using Bayesian probability 
theory we derive a straightforward algorithm that computes the posterior probability of 
the number of bins for a given data set. This enables one to objectively select an optimal 
piecewise-constant model describing the density function from which the data were sampled. 



2. The Piecewise-Constant Density Model 

We begin by considering the histogram as a piecewise-constant model of the probability 
density function from which TV data points were sampled. This model has M bins with 
each bin having width v fe , where k is used to index the bins. We further assume that 
the bins have equal width v = v k for all k, and together they encompass an entire width 
V = Mv.] Each bin has a "height" h k , which is the constant probability density over the 
region of the bin. Integrating this constant probability density h k over the width of the bin 
Vk leads to a total probability mass of Tr k = h k Vk for the bin. This leads to the following 
piecewise-constant model h(x) of the unknown probability density function f{x) 

M 

h ( x ) = X! hk U ( x k-l,X,X k ), (5) 
fc=l 

where h k is the probability density of the k th bin with edges defined by x k -i and x k , and 
H(xk—i,x,Xk) is the boxcar function where 

{0 if x < x a 
1 if x a < x < x b (6) 
if Xb < x 

Our density model can be re-written in terms of the bin probabilities irk as 

M M 

h(x) = — y"Vfc U(x k -i,x,Xk). (7) 

k=l 

fFor a one-dimensional histogram, life is the width of the k th bin. In the case of a multi- 
dimensional histogram, this will be a multi-dimensional volume. 
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It is important to keep in mind that the piecewise-constant density model is not a histogram 
in the usual sense. We use the bin heights to represent the probability density and not 
the probability mass of the bin. Furthermore, we will show that the mean value of the 
probability mass of a bin is computed in a slightly different way. 

Given M bins and the normalization condition that the integral of the probability density 
equals unity, we arc left with M — 1 bin probabilities: 7Ti,7T2, . . . ,ttm—i, each describing 
the probability that samples will be drawn from each of the M bins. The normalization 
condition requires that ttm = 1— SfcLi 1 n k- F° r simplicity, we assume that the bin alignment 
is fixed so that extreme data points lie precisely at the center of the extreme bins of the 
histogram (that is, the smallest sample is at the center of the leftmost bin, and similarly for 
the largest sample). As we will show, this technique is easily extended to multi-dimensional 
densities of arbitrarily high dimension. 



2. 1. The Likelihood of the Piecewise-Constant Model 

The likelihood function is a probability density that when multiplied by dx describes prob- 
ability that a datum point d n is found to have a value in the infinitesimal range between x 
and x + dx. For d n to have a value of x occurring in the k th bin, the likelihood is simply 
the probability density in the region defined by the bin 

p(d n \w k ,M,I) = h k = ^ (8) 

where I represents our prior knowledge about the problem, which includes the range of the 
data and the bin alignment. For equal width bins, the likelihood density reduces to 

P (d n \TT k ,M,I) = — TTfe. (9) 

For N independently sampled data points, the joint likelihood is given by 

p(d\ K ,M,I)= (Jf) ^vtf (10) 

where d = {d\, cfc, ■ ■ ■ , d/v} and n = {7ti,7T2, ■ ■ ■ , ttm-i}- Equation (|10|l is data-dependent 
and describes the likelihood that the hypothesized piecewise-constant model accounts for 
the data. Individuals who recognize this as having the form of the multinomial distribution 
may be tempted to include its familiar normalization factor. However, it is important to 
note that this likelihood function is properly normalized as is, which we now demonstrate. 
For a single datum point d, the likelihood that it will take the value x is 



M 



p(d = x\n,M,I) = -VVfc Il(xk-i,x,Xk), 
v * — ' 



(11) 



fc=i 
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where we have written v = Multiplying the probability density by dx to get the 

probability and integrating over all possible values of x we have 

M 



/OO pOO JlJ 

dx p(d = x\tt, M, I) = / - ^ (12) 

-oo J — oo ^ 

= - ^ / da; 7r fc II(a;fe_i,x,Xfe) (13) 

W I i J— oo 



fe=l 



= -V7T fc t) (14) 
7) Z — ^ 



fc=l 



5>fc (15) 



fc=l 

= 1- (16) 

2.2. Tfte Pr/'or Probabilities 

For the prior probability of the number of bins, we assign a uniform density 

P (M|/) = {^ *1<M<C (17) 
v 1 ' ( otherwise 

where C is the maximum number of bins to be considered. This could reasonably be set to 
the range of the data divided by smallest non-zero distance between any two data points. 

We assign a non-informative prior for the bin parameters 7ri,7T2, . . . , ttm-i, the possible 
values of which lie within a simplex defined by the corners of an M-dimensional hypercube 
with unit side lengths 



Y(K) 

P(2L|M,J) = -^ 

1 V 2 / 



7T17T2 ' ' ' 7TM_i ( 1 - ^ TT 



M-l 



i=l 



-1/2 



(18) 



Equation 11811 is the Jeffreys's prior for the m ultinomial likelihood I|1U|) j Jeffreys. 196 it 
iBox and Tiao. 19921: iBerger and Bernardo. 19 92). and has the advantage in that it is also 
the conjugate prior to the multinomial likelihood. 



2.3. The Posterior Probability 

Using Bayes' Theorem, the posterior probability of the histogram model is proportional to 
the product of the priors and the likelihood 

p(K, M\d, I) cx p( K \I) p(M\I) p(d\TT, M, I). (19) 

Substituting ifTHjl. fTTjl . and l(T%|l gives the joint posterior probability for the piecewise- 
constant density model 

/ n/r\ N r(M.\ I, , / M_1 \»m-| 

P ( K , M\d, i) oc(^) i^^^ 2 " 2 ^..-irr 5 (i-E-0 . (2°) 
\ / ^(2) ^ i=i ' 
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where p(M\ I) is absorbed into the implicit proportionality constant with the understanding 
that we will only consider a reasonable range of bin numbers. 

The goal is to obtain the posterior probability for the number of bins M. To do this we 
integrate the joint posterior over all possible values of ir 1: tt 2 , ■ ■ ■ , t^m-i in the simplex. The 
expression we desire is written as a series of nested integrals over the M — 1 dimensional 
parameter space of bin probabilities 

p(M\d,I)oc[^) -^l / d^.T^ dn 2 n?->... (21) 



vj r(|) 



diiM-i 2 I 1 — 2, ni 



v i=l 

In order to write this more compactly, we first define 

ft! = 1 (22) 

ft2 = 1 — 7Tl 

a 3 = 1 — 7Tl - 7T2 

M-2 
k=l 

and note the recursion relation 

flfc = a-k-i ~ TTfc-i- (23) 
These definitions greatly simplify the sum in the last term as well as the limits of integration 

/ A/A N r(^-) r ai i r a2 i 

p(M\d,I)oc[^) -ill ^ d* x «?-* J a dn 2 ^... (24) 

O'M — X 1 

d-KM-l CIT" 5 ^/-! -TTM-l)""" 5 . 



To solve the set of nested integrals in Q21[l. consider the general integral 



4 = / aV fe ^ fe fc 5 (a fc -TTfc) 6 * (25) 



where bu G K + and > 1/2. This integral can be re- written as 



h = 4 fc /"* *n k C '(l-^y. (26) 



70e 

Setting Uk = — we have 
ak 



Ik = a b k " / dua n k k + *u n *-Hl-u) b » 



n 



J 

o*»+"*+i B (n fc + |,b fc + l) (27) 
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where B(-) is the Beta function with 

r(n fe + i)r(6 fc + i) 



B{n k + ^b k + l) = — f " -■ C-'N 

r( n k + \+b k + l 



To solve all of the integrals we rewrite a k in (|27[) using the recursion formula (|23[) 



By defining 



we find 



Ik = K-i - ir k ~i) bk+nk+1 i B{n k + i A + 1). (29) 



&JW-1 = n M - - (30) 
bk-i = b k + n k + ^ 



h = N-n l+ ™-\. (31) 



Finally, integrating (|24|l gives 



/ M\ N T(M.\ M ~ 1 i 
p(M\d,I)cx(y) l[B(n k + -,b k + l), (32) 

^ ' ^{2) k=l 

which can be simplified further by expanding the Beta functions using Ij28(l 

M\ N r(f) r(n 1 + i)r(6 1 + i) r(n 2 + i)r(6 2 + 1) 



p(M\d,T)<x [ — 



r(i) M r(m + | + 6i + i) r(n 2 + ± + 6 2 + i) 

r(n M -i + |)r(6M-i + 1) 



r(n M _i + I + 6 M -i + 1) 



(33) 



Using the recursion relation (I3U|) for the b k , we see that the general term T(b k + 1) in each 
numerator, except the last, cancels with the denominator in the following term. This leaves 

P (M\ d I) oc (M_) N im. U^Tink + h) (34) 

p[ '-' j \v) r (i)^ r( ni + &! + §)' 1 j 

where we have used <|3(J|) to observe that T(6m-i + 1) = T("a/ + 1/2)- Last, again using the 

M 3 

2 2 ' 



recursion relation in l|3l)fl we find that b\ ~ N — n\ + 4f — |, which results in our marginal 



M\ w r(f) nf =1 r(n fe + i) 



posterior probability 

V^/ r(i) M r(7v + f) 

The normalization of this posterior probability density depends on the actual data used. 
For this reason, we will work with the unnormalizcd posterior, and shall refer to it as the 
relative posterior. 
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In optimization problems, it is often easier to maximize the logarithm of the posterior 



logp(M|d,/) = TV log M + log r^— J - m iogr^-J - io g rl n+ — ) - CUh 



M / i\ 

+ ^> g r + - +K 



fc=i 



where K represents the sum of the volume term and the logarithm of the implicit propor- 
tionality constant. The optimal number of bins M is found by identifying the mode of the 
logarithm of the marginal posterior 

M = arg max{logp(M|d, I)}. (37) 

M 

Such a result is reassuring, since it is independent of the order in which the bins are counted. 

Many software packages are equipped to quickly compute the log of the gamma function. 

However, for more basic implementations, the following definitions from lAbramowitz and Stegun f 1972^1 
can be used for integer m . 

rn 1 

logr(m) = ^logfc (38) 

k=l 

( 1\ 1 m 

logT m+- = -log7r-nlog2 + J]log(2/c- 1) (39) 



k=l 



Equation (|36J) allows one to easily identify the number of bins M which optimize the poste- 
rior. We call this technique the optBINS algorithm and provide a Matlab code implemen- 
tation in Appendix 1. 



3. The Posterior Probability for the Bin Height 

In order to obtain the posterior probability for the probability mass of a particular bin, we 
begin with the joint posterior (|65l) and integrate over all the other bin probability masses. 
Since we can consider the bins in any order, the resulting expression is similar to the multiple 
nested integral in l|21|l except that the integral for one of the M — 1 bins is not performed. 
Treating the number of bins as a given, we can use the product rule to get 

where the numerator is given by l|65|l and the denominator by l|35jl. Since the bins can be 
treated in any order, we derive the marginal posterior for the first bin and generalize the 
result for the k th bin. The marginal posterior is 

(M \N r (f) 

p(m\d, M, I) = p^ 2 /) / dn 2 n? S / dn 3 ^ . . . (4 1. ) 



&M-1 







dlfM-l^M-1 2 («M-1 ~ 7TM-l)" M 3 
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Evaluating the integrals and substituting l|35|) into the denominator we get 
P^\d,M,I) = n£ 2 ^(n fc + ^ fc + l) , _ 

rifc=i-BK + i,6 fe + i) 

Cancelling terms and explicitly writing 6 l5 the marginal posterior for 7rj is 



r(iv + f ) 
r(ni + 4)r(jv-m + ^)" 1 



p(7ri|d,M,2) = - , lN ; ' 2 ' , M-T^ x Hl-nxr-"^— , (43) 



which can easily be verified to be normalized by integrating m over its entire possible range 
from to 1. Since the bins can be considered in any order, this is a general result for the 
k th bin 



p(^\d,M,I) = ^ + " } ^ ^d-^. '44' 
T(n k + s)r(JV - n fe H — ) 



2^ \" '** 1 2 
The mean bin probability mass can be found from its expectation 



r-l 

(7T fc > = / dit k 7T fe p(-K k \d,M,I), (45) 



which substituting jUjl gives 

- r , +1 ^ f) + M- M f ^ (i - « k ) N - n * +M ^ ■ (46) 

r(n fc + - n k H — ) Jo 

The integral again gives a Beta function, which when written in terms of Gamma functions 
is 

r(Ar + f) r(n fc + |)r(jv-n fc + ^) 

w r(n fc + |)r(iv-n fc + ^i)' r(iv + f + i) ' 10 

Using the fact that r(a; + 1) = xr(a;) and cancelling like terms, we find that 

1 



2 

The mean probability density for bin k (the bin height) is simply 



v k \Vj\N+f 

It is an interesting result that bins with no counts still have a non-zero probability. This 
makes sense since no lack of evidence can ever prove conclusively that an event occurring in 
a given bin is impossible — just less probable. The Jeffrey's prior effectively places one-half 
of a datum point in each bin. The variance of the height of the k th bin is found similarly 

by 

4={y)\(4)-(^) 2 ), (50) 
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which gives 

'M\ 2 f{n k + \){N -nu + M^l) 



V J V (N + M- + l)(N+f ; 



(51) 



Thus, given the optimal number of bins found by maximizing (HU, the mean and variance 
of the bin heights are found from (|49(l and l|51|l. which allow us to construct an explicit 
histogram model of the probability density and perform computations and error analysis. 
Note that in the case where there is one bin (|51|l gives a zero variance. 



4. Results 

4. 1. Demonstration using One-dimensional Histograms 

In this section we demonstrate the utility of this method for determining the optimal number 
of bins in a histogram model of several different data sets. Here we consider 1000 data points 
sampled from four different probability density functions. The optimal histogram for 1000 
data points drawn from a Gaussian distribution A^O, 1) is shown in Figure 1A, where it is 
superimposed over a 100-bin histogram showing the density of the sampled points. Figure 
IB shows that the logarithm of the posterior probability l|3tjfl peaks at 14 bins. Figure 1C 
shows the optimal binning for data sampled from a 4-step piecewise-constant density. The 
logarithm of the posterior (Figure ID) peaks at 4 bins, which indicates that the algorithm 
can correctly detect the 4-step structure. In figures IE and F, we see that samples drawn 
from a uniform density were best described by a single bin. This result is significant, 
since entropy estimates computed from these data would be biased if multiple bins were 
used. Last, we consider a density function that consists of a mixture of three sharply-peaked 
Gaussians with a uniform background (Figure 1G). The posterior peaks at 52 bins indicating 
that the data warrant a detailed model (Figure 1H). The spikes in the log posterior are due 
to the fact that the bin edges are fixed. The log posterior is large at values of M where the 
bins happen to line up with the Gaussians, and small when they are misaligned. This last 
example demonstrates one of the weaknesses of the equal bin-width model, as many bins 
are needed to describe the uniform density between the three narrow peaks. 



4.2. Comparison to Other Techniques 

We now compare our results with those obtained using Scott's Rule, Stone's Rule, and the 
Akaike model selection criterion (Akai ke (T974(0 . Since Freedman and Diaconis' (F&D) 
method has the same functional form as Scott's Rule, the results using F&D are not pre- 
sented here. Their technique leads to a greater number of bins than Scott's Rule, which, 
as we will show, already prescri bes more bins th an are warranted by the data. Akaike's 
method, applied to histograms in lHartiera n (1996), balances the logarithm of the likelihood 
of the model against the number of model parameters. The number of bins is chosen to 
maximize 

AIC(M) = logp(d n \n k ,M, I) - M. (52) 

Since Scott's Rule was derived to be asymptotically optimal for Gaussian distributions, 
we limited our comparison to Gaussian-distributed data. We tested data sets with 12 
different numbers of samples including N = {200, 500, 1000, 2000, • • • , 1000000}. For each N 
we tested 50 different histograms, each with N samples drawn from a Gaussian distribution 
A/"(0, 1), for a total of 700 histograms in this analysis. The optimal number of bins was 
found using Scott's Rule ©, Stone's Rule Q, Akaike's AIC (JE3) and the present optBINS 




Fig. 1 . To demonstrate the technique, 1 000 samples were drawn from four different probability den- 
sity functions (pdf) above. (A) The optimal histogram for 1000 samples drawn from a Gaussian 
density function is superimposed over a 100-bin histogram that shows the distribution of data sam- 
ples. (B) The log posterior probability of the number of bins peaks at 14 bins for these 1000 data 
sampled from the Gaussian density. (C) Samples shown are from a 4-step piecewise-constant den- 
sity function. The optimal binning describes this density accurately since (D) the log posterior peaks 
at four bins. (E) These data were sampled from a uniform density as verified by the log posterior 
probability (F). (G) shows a more complex example — three Gaussian peaks plus a uniform back- 
ground. (H) The posterior, which peaks at 52 bins, demonstrates clearly that the data themselves 
support this detailed picture of the pdf. 
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2 4 6 8 lOxio 5 2 4 6 8 10 xio 5 

Number of Data Points (N) Number of Data Points (N) 



Fig. 2. Four techniques are compared: optBiNS (solid), AIC (dot-dash), Stone (dash), and Scott 
(dot). For each number of data points N, 50 separate sets of samples were drawn from a Gaussian 
distribution 7V(0, 1). With each set of samples, each method was used to determine the optimal 
number of bins M and the Integrated Square Error (ISE) between each of the 50 modelled density 
functions and the original Gaussian density. Mean and standard deviations of these results were 
computed for the 50 samples. (A) This pair of figures (top: inset, bottom: complete figure) shows 
the mean and standard deviation of the number of bins M determined by each algorithm. For large 
numbers of data points N > 5000, optBiNS consistently chooses a smaller number of bins than 
the other techniques. (B) The figures on the right compare the ISE between the modelled density 
function and the true density function. The present algorithm optBiNS consistently outperforms the 
other methods, despite the fact that Scott's Rule is derived to asymptotically optimize the ISE when 
the sampling distribution is a Gaussian. 
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algorithm The quality of fit was quantified using the Integrated Square Error (ISE), 

which is the criterion for which Scott's Rule is asymptotically optimized. This is 

ISE = J dx(h(x,M)-^=exp(- — )\, (53) 

where h{x,M) is the piecewise-constant histogram model with M bins. 

Figure shows the mean number of bins M found using each of four methods for the 
values of the number of samples N tested. The present algorithm optBINS tends to choose 
the least number of bins, and does so consistently for N > 5000. More importantly, Figure 
03 shows the mean ISE between the modelled density function and the correct Gaussian 
distribution Af(0, 1). Since Scott's Rule was derived to asymptotically optimize the ISE, it 
may be surprising to see that optBINS outperforms each of these methods with respect to 
this measure. While Scott's Rule may be optimal as N — > oo, optBINS, which is based on 
a Bayesian solution, is optimal for finite N. Furthermore if a log probability (or log-odds 
ratio) criterion were to be used, optBINS would be optimal by design. 



5. Effects of Small Sample Size 

5. 1 . Small Samples and Asymptotic Behavior 

It is instructive to observe how this algorithm behaves in situations involving small sample 
sizes. We begin by considering the extreme case of two data points N = 2. In the case of a 
single bin, M = 1, the posterior probability reduces to 

p(M = l\d u d 2 ,I) cx M -— ^ t{n+ m ) 

r(i) r(2 + i) , , 



ri 



2 / 



2/ 



so that the log posterior is zero. For M > 1, the two data points lie in separate bins, 
resulting in 



p(M\d 1 ,d 2 ,I) cx M 



N 



r(f) + 



v 2 

- T( 

oc M 



r(i) M r(N + f) 



2 r(M) r(i + ^r(ir 
T (i) M r(2 + f) 



,2 

' M 

, i. i — i i \ 

cx M 



9 r(|)» r(f) 



v{iy r(2 + f) 



,2 

1 M 



2 l + 4f 



(55) 



Figure|2K shows the log posterior which starts at zero for a single bin, drops to log(i) for 
M = 2 and then increases monotonically approaching zero in the limit as M goes to infinity. 
The result is that a single bin is the most probable solution for two data points. 
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Fig. 3. These figures demonstrate the behavior of the relative log posterior for small numbers of 
samples. (A) With only N = 2 samples, the log posterior is maximum when the samples are in the 
same bin M = 1. For M > 1, the log posterior follows the function described in 1551 in the text. (B) 
The relative log posterior is slightly more complicated for N = 3. For M = 1 all three points lie in 
the same bin. As M increases, two data points are in one bin and the remaining datum point is in 
another bin. The functional form is described by 1551 . Eventually, all three data points lie in separate 
bins and the relative log posterior is given by |57J. (C) The situation is more complicated still for 
N = 5 data points. As M increases, a point is reached when, depending on the particular value 
of M, the points will be in separate bins. As M changes value, two points may again fall into the 
same bin. This gives rise to this oscillation in the log posterior. Once all points are in separate bins, 
the behavior follows a well-defined functional form 1581 . (D) This plot shows the behavior for a large 
number of data points N = 200. The log posterior now displays a more well-defined mode indicating 
that there is a well-defined optimal number of bins. As M approaches 10000 to 100000 bins, one 
can see some of the oscillatory behavior demonstrated in the small N cases. 
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For three data points in a single bin (TV = 3 and M = 1), the posterior probability is 
one, resulting in a log posterior of zero. In the M > 1 case where there are two data points 
in one bin and one datum point in another, the posterior probability is 

3 M 2 

p(M\di,d2,d 3 ,r)<x-- (2+ M )(1 + M y ( 56 ) 
and for each point in a separate bin we have 

1 M 2 

p(M\d 1 ,d 2 ,d 3 ,I)cc-- (2+ m )(1+ m } - ( 57 ) 

While the logarithm of the posterior in l|56(l can be greater than zero, as M increases, the 
data points eventually fall into separate bins. This causes the posterior to change from 
15ti|) to (|57|) resulting in a dramatic decrease in the logarithm of the posterior, which then 
asymptotically increases to zero as M — > oo. This behavior is shown in Figure [3)3. The 
result is that either one or two bins will be optimal depending on the relative positions of 
the data points. 

More rich behavior can be seen in the case of N = 5 data points. The results again 
(Figure|3p) depend on the relative positions of the data points with respect to one another. 
In this case the posterior probability switches between two types of behavior as the number 
of bins increase depending on whether the bin positions force two data points together in 
the same bin or separate them into two bins. The ultimate result is a ridiculous maximum a 
posteriori solution of 57 bins. Clearly, for a small number of data points, the mode depends 
sensitively on the relative positions of the samples in a way that is not meaningful. In these 
cases there are too few data points to model a density function. 

With a larger number of samples, the posterior probability shows a well-defined mode 
indicating a well-determined optimal number of bins. In the general case of M > N where 
each of the N data points is in a separate bin, we have 

/ M \ N r(M-) 

p(M|4i)«( T J j^^y, (58) 

which again results in a log posterior that asymptotically approaches zero as M — » oo. 
Figure demonstrates these two effects for N — 200. This also can be compared to the 
log posterior for 1000 Gaussian samples in Figure 



5.2. Sufficient Data 

This investigation on the effects of small sample size raises the question as to how many 
data points are needed to estimate the probability density function. The general shape of a 
healthy log posterior reflects a sharp initial rise to a well-defined peak, and a gradual fall-off 
as the number of bins M increases from one (eg. Fig. ^3, Fig. With small sample sizes, 
however, one finds that the bin heights have large error bars (Figure so that fa ~ <7j, 
and that the log posterior is multi-modal (Figure^) with no clear peak. 

We tested our algorithm on data sets with 199 different sample sizes from N = 2 to 
N = 200. One thousand data sets were drawn from a Gaussian distribution for each value 
of N. The standard deviation of the number of bins obtained for these 1000 data sets at a 
given value if N was used as an indicator of the stability of the solution. 
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Fig. 4. (A) An optimal density model (M = 19) for N = 30 data points sampled from a Gaussian 
distribution. The fact that the error bars on the bin probabilities are as large as the probabilities 
themselves indicates that this is a poor estimate. (B) The log posterior probability for the number of 
bins possesses no well-defined peak, and is instead reminiscent of noise. (C) This plot shows the 
standard deviation of the estimated number of bins M for 1000 data sets of N points, ranging from 
2 to 200, sampled from a Gaussian distribution. The standard deviation stabilizes around a M = 2 
bins for TV > 150 indicating the inherent level of uncertainty in the problem. This suggests that one 
requires at least 150 data points to consistently perform such probability density estimates, and can 
perhaps get by with as few as 100 data points in some cases. 



Figure 0p shows a plot of the standard deviation of the number of bins selected for the 
1000 data sets at each value of N. As we found above, with two data points, the optimal 
solution is always one bin giving a standard deviation of zero. This increases dramatically 
as the number of data points increases, as we saw in our example with N — 5 and M = 57. 
This peaks around N — 15 and slowly decreases as N increases further. The standard 
deviation of the number of bins decreased to ctm < 5 for N > 100, and stabilized to ctm — 2 
for N > 150. 

While 30 samples may be sufficient for estimating the mean and variance of a density 
function known to be Gaussian, it is clear that more samples are needed to reliably estimate 
the shape of an unknown density function. In the case where the data are described by a 
Gaussian, it would appear that at least 100 samples, and preferentially 150 samples, are 
required to accurately and consistently infer the shape of the density function. By examining 
the shape of the log posterior, one can easily determine whether one has sufficient data to 
estimate the density function. In the event that there are too few samples to perform such 
estimates, one can either incorporate additional prior information or collect more data. 
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Fig. 5. N = 1000 data points were sampled from a Gaussian distribution Af(Q, 1). The top plots 
show (A) the estimated density function using optimal binning and (B) the relative log posterior, which 
exhibits a well-defined peak at M = 11 bins. The bottom plots reflect the results using the same data 
set after it has been rounded with Ax = 0.1 to keep only the first decimal place. (C) There is no 
optimal binning as the algorithm identifies the discrete structure as being a more salient feature than 
the overall Gaussian shape of the density function. (D) The relative log posterior displays no well- 
defined peak, and in addition, for large numbers of M displays a monotonically increase given by 
1551 that asymptotes to a positive value. This indicates that the data have been severely rounded. 



6. Digitized Data 

Due to t he way that computers repres ent data, all data are essentially represented by 
integers l|Bavnian and Broadhurst. 1979j) . In some cases, the data samples have been in- 
tentionally rounded or truncated, often to save storage space or transmission time. It is 
well-known that any non-invertible transformation, such as rounding, destroys information. 
Here we investigate how severe losses of information due to rounding or truncation affects 
the optBINS algorithm. 

When data are digitized via truncation or rounding, the digitization is performed so as 
to maintain a resolution that we will denote by Ax. That is, if the data set has values that 
range from to 1, and we represent these numbers with an 8 bits, the minimum resolution 
we can maintain is Ax = 1/2 8 = 1/256. For a sufficiently large data set (in this example 
N > 256) it will be impossible for every datum point to be in its own bin when the number 
of bins is greater than a critical number, M > Ma x , where 



M Ax = — , 
Ax 



(59) 
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and V is the range of the data considered. Once M > M/\ x the number of populated bins 
P will remain unchanged since the bin width w for M > Ma x wn l be smaller than the 
digitization resolution, w < Ax. 

For all bin numbers M > M& x , there will be P populated bins with populations 
ni,ri2, ■ ■ ■ , np. This leads to a form for the marginal posterior probability for M l|35|) 
that depends only on the number of instances of each discrete value that was recorded, 
m, Ji2, . . • , np. Since these values do not vary for M > M& x , the marginal posterior can be 
viewed solely as a function of M 



m\ n r(f) u^vj^ + i) 



where the product over p is over populated bins only. Comparing this to (|58|l , the function 
on the right-hand side asymptotically approaches a value greater than one — so that its 
logarithm increases asymptotically to a value greater than zero. 

As the number of bins M increases, the point is reached where the data can not be 
further separated; call this point M cr i t . In this situation, there are n p data points in the 
p th bin and the posterior probability can be written as 

/M\ N T(^) P 
p(M\d,I)*{-) _^U^.JJ(2n p -l)!!, (61) 

where !! denotes the double factorial. For M > M cr n, as M — > oo, the log posterior 
asymptotes to ^ j log((2n p — 1)!!), which can be further simplified to 

p p 2n„-l 

]Tlog((2n p -l)!!) = (P-A01og(2) + ]r £ logs. (62) 

p— 1 p— 1 s—rip 

To test for excessive rounding or truncation, the mode of log p(M\d, I) for M < M cr i t 
should be compared to (|62[) above. If the latter is larger, than the discrete nature of the data 
is a more significant feature than the general shape of the underlying probability density 
function. When this is the case, a reasonable histogram model of the density function can 
still be obtained by adding a uniformly-di stributed random number, with a range defined 
by the resolution Ax, to each datum point IjBavman and Broadhurst. 1979^) . While this will 
produce the best histogram possible given the data, this will not recover the lost information. 



7. Multi-Dimensional Histograms 

In this section, we demonstrate that our method can be extended naturally to multi- 
dimensional histograms. We begin by describing the method for a two-dimensional his- 
togram. The constant-piecewise model h(x, y) of the two-dimensional density function 
f(x,y) is 

M M * My 

h(x,y;M x ,M y ) = — ^^7^ K( x 3-i> x > ^^(Vk-t, V, Vk), (63) 

3=1 k=l 

where M — M x M y , V is the total area of the histogram, j indexes the bin labels along x, 
and k indexes them along y. Since the Wj^ all sum to unity, we have M — 1 bin probability 
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Fig. 6. 10000 samples were drawn from a two-dimensional Gaussian density to demonstrate the 
optimization of a two-dimensional histogram. (A) The relative logarithm of the posterior probability is 
plotted as a function of the number of bins in each dimension. The normalization constant has been 
neglected in this plot, resulting in positive values of the log posterior. (B) This plot shows the relative 
log posterior as a contour plot. The optimal number of bins is found to be 12 x 14. (C) The optimal 
histogram for this data set. (D) The histogram determined using Stone's method has 27 x 28 bins. 
This histogram is clearly sub-optimal since it highlights random variations that are not representative 
of the density function from which the data were sampled. 
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density parameters as before, where M is the total number of bins. The likelihood of 
obtaining a datum point d n from bin (j, k) is still simply 

M 

p(d n \iTj,k,M x ,My,I) = yTTj.fe. (64) 
The previous prior assignments result in the posterior probability 

N p/M\ M x My 

v) r 



p(n,M Xl M y \d,I)^(y) nil^f"*' ( 65 ) 

2} 3 = 1 



where 7r m x .m v is 1 minus the sum of all the other bin probabilities. The order of the 
bins in the marginalization does not matter, which gives a result similar in form to the 
one-dimensional case 

M\ N r(f) H/Ml^n^ + 1) 

V) r (i)« T(N+f) 



p(M x ,M v \d,I)<x(-) 3=1 't' M ; (66) 



where M = M x M y . 



For a D-dimensional histogram, the general result is 

iv r /M\ tt m i rr M D 



M\"r(¥) n"ii---n,"ir(n, ,„ + ±) 



where Mi is the number of bins along the i th dimension, M is the total number of bins, V 
is the _D-dimensional volume of the histogram, and ra, ll ...^ D indicates the number of counts 
in the bin indexed by the coordinates (ii, . . . , io). Note that the result in (|36|l can be used 
directly for a multi-dimensional histogram simply by relabelling the multi-dimensional bins 
with a single index. 

Figure HJI demonstrates the procedure on a data set sampled from a two-dimensional 
Gaussian. In this example, 10000 samples were drawn from a two-dimensional Gaussian 
density. Figure shows the relative logarithm of the posterior probability plotted as a 
function of the number of bins in each dimension. The same surface is displayed as contour 
plot in Figure EJ3, where we find the optimal number of bins to be 12 x 14. Figure 
shows the optimal two-dimensional histogram model. Note that modelled density function 
is displayed in terms of the number of counts rather than the probability density, which can 
be easily computed using with error bars computed using l|51|) . In Figure 03, we show 
the histogram obtained using Stone's method, which results in a 27 x 28 array of bins. This 
is clearly a sub-optimal model since random sampling variations are easily visible. 



8. Discussion 



The optimal binning algorithm presented in this paper, optBINS, relies on finding the mode 
of the marginal posterior probability of the number of bins in a piecewise-constant density 
function. This posterior probability originates as a product of the likelihood of the density 
parameters given the data and the prior probability of those same parameter values. As 
the number of bins increases, the model can better fit the data, which leads to an increase 
in the likelihood function. However, by introducing additional bins, the prior probability, 
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which must sum to unity, is now spread out over a parameter space of greater dimensionality 
resulting in a decrease of the prior probability. Thus a density model with more bins will 
result in a greater likelihood, but also in a smaller prior probability. Since the posterior 
is a product of these two functions, the maximum of the posterior probability occurs at a 
point where these two opposing factors are balanced. This interplay between the likelihood 
and the prior probability effectively implements Occam's razor by selecting the most simple 
model that best describes the data. 

The quality of this algorithm was demonstrated by testing it against several other pop- 
ular bin selection techniques. The optBINS algorithm outperformed all techniques inves- 
tigated in the sense that it consistently resulted in the minimum integrated square error 
between the estimated density model and the true density function. This was true even in 
the case of Scott's technique, which is designed to asymptotically minimize this error for 
Gaussian-distributed data. Our algorithm also can be readily applied to multi-dimensional 
data sets, which we demonstrated with a two-dimensional data set. In practice, we have 
been applying optBINS to three-dimensional data sets with comparable results. 

It should be noted that we are working with a piecewise-constant model of the density 
function, and not a histogram. The distinction is subtle, but important. Given the full pos- 
terior probability for the model parameters and a selected number of bins, one can estimate 
the mean bin probabilities and their associated standard deviations. This is extremely use- 
ful in that it quantifies uncertainties in the density model, which can be used in subsequent 
calculations. In this paper, we demonstrated that with small numbers of data points the 
magnitude of the error bars on the bin heights is on the order of the bin heights themselves. 
Such a situation indicates that too few data exist to infer a density function. This can 
also be determined by examining the marginal posterior probability for the number of bins. 
In cases where there are too few data points, the posterior will not possess a well-defined 
mode. In our experiments with Gaussian-distributed data, we found that approximately 
150 data points are needed to accurately estimate the density model when the functional 
form of the density is unknown. This approach has an additional advantage in that we 
can use the optBINS algorithm to identify data sets wher e the data have bee n excessively 
truncated. This will be explored further in future papers jKnuth et al.. 200fih . 

As always, there is room for improvement. The optBINS algorithm, as currently imple- 
mented, performs a brute force search of the number of bins. This can be slow for large data 
sets that require large numbers of bins, or multi-dimensional data sets that have multiple 
bin dimensions. We have also made some simplifying assumptions in designing the algo- 
rithm. First, the endpoints of the density model are defined by the data, and are not allowed 
to vary during the analysis. Second, we use the marginal posterior to select the optimal 
number of bins and then use this value to estimate the mean bin heights. This neglects the 
fact that we actually possess uncertainty about the number of bins. Thus the uncertainty 
in the number of bins is not quantified. Last, equally-spaced bins can be very inefficient in 
describing multi-modal density func tions (as in Fi g. [Bj.) In such cases, variable bin-width 
models such as Bayesian Blocks \ Jackson et al.. 2005[) or the Markov chain Monte Carlo 
implementations that we have experimented with may prove to be more useful. 

For most applications, optBINS efficiently delivers an appropriate number of bins that 
both maximizes the depiction of the shape of the density function while minimizing the 
appearance of random fluctuations. 
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Appendix 1 : Matlab code for the optimal number of uniform bins in a histogram 

7. optBINS computes the optimal number of bins for a given one-dimensional 
7, data set. This optimization is based on the posterior probability for 
7» the number of bins 

% 

7. Usage: 

7. optM = optBINS(data,minM,maxM) ; 

7. 

7. Where: 

7» data is a (1,N) vector of data points 

7» minM is the minimum number of bins to consider 

7. maxM is the maximum number of bins to consider 

7. 

% This algorithm uses a brute-force search trying every possible bin number 

7. in the given range. This can of course be improved. 

7. Generalization to multidimensional data sets is straightforward. 

7. 

7. Created by Kevin H. Knuth on 17 April 2003 
7. Modified by Kevin H. Knuth on 15 March 2006 

function optM = optBINS(data,minM,maxM) 

if size(data)>2 I size (data, 1) >1 

error ('data dimensions must be (1,N)'); 

end 

N = size(data,2) ; 

7. Simply loop through the different numbers of bins 
7. and compute the posterior probability for each, 
logp = zeros ( 1, maxM) ; 
for M = minM:maxM 

n = hist (data, M) ; % Bin the data (equal width bins here) 

parti = N*log(M) + gammaln(M/2) - gammaln(N+M/2) ; 

part2 = - M*gammaln(l/2) + sum(gammaln(n+0 . 5) ) ; 

logp(M) = parti + part2; 

end 

[maximum, optM] = max(logp) ; 
return 
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